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Abstract 

This paper applies a hidden Markov model to the problem of Attention Deficit 
Hyperactivity Disorder (ADHD) diagnosis from resting-state functional Magnetic 
Resonance Image (fMRI) scans of subjects. The proposed model considers the 
temporal evolution of fMRI voxel activations in the cortex, cingulate gyrus, and 
thalamus regions of the brain in order to make a diagnosis. Eour feature dimen¬ 
sionality reduction methods are applied to the fMRI scan: voxel means, voxel 
weighted means, principal components analysis, and kernel principal components 
analysis. Using principal components analysis and kernel principal components 
analysis for dimensionality reduction, the proposed algorithm yielded an accu¬ 
racy of 63.01% and 62.06%, respectively, on the ADHD-200 competition dataset 
when differentiating between healthy control, ADHD innattentive, and ADHD 
combined types. 


1 Introduction 

Statistical machine learning methods have recently permeated disciplines such as Psychiatry, which 
specialize in the diagnosis and treatment of neuropsychiatric disorders. The availability of large- 
scale functional Magnetic Resonance Image (fMRI) datasets have encouraged the application of 
advanced machine learning models to the diagnosis of neuropsychiatric disorders |[TJ. fMRI scans 
measure brain activity by detecting fluctuations in blood-oxygen levels over time. Brain activations 
are represented digitally as voxels, the three-dimensional analogue of pixels. 

In this paper we present the application of a temporal model, specifically a Hidden Markov 
Model (HMM), to the problem of automatically diagnosing Attention Deficit Hyperactivity Dis¬ 
order (ADHD) from resting-state fMRI scans of subjectsF ADHD is a psychiatric disorder that 
adversely affects the attention span, hyperactivity, or impulsivity of an individual. Since ADHD 
positive individuals have difficulty maintaining focus on a mental activity, it is reasonable to assume 
that the temporal evolution of their brain activities, as measured by the fMRI, differ from those of 
healthy individuals. It is with this argument that we motivate the use of a temporal model for ADHD 
diagnosis. 

The primary challenge in developing a machine learning algorithm for ADHD diagnosis is the large 
dimensionality of the fMRI scan. A single fMRI scan may consist of hundreds of three-dimensional 
images over time, each of which is composed of approximately 500,000 voxels. Therefore, extract¬ 
ing lower dimensional fMRI representations that retain discriminative features for diagnosis is an 
important step in the implementation of diagnosis algorithms. If successful, an automatic ADHD 


*A resting-state fMRI measures the brain activities of subjects that are not asked to perform a given task 
during the course of the scan. 
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diagnosis algorithm would aid mental-health care professionals in the diagnosis and treatment of the 
disorder. 

This paper is structured as follows: the following section provides a review of related work. Sec¬ 
tion [^describes the proposed temporal model for ADHD diagnosis and outlines the feature extrac¬ 
tion and feature representation techniques explored. Section|^describes the dataset used for training 
and evaluating the proposed algorithm. In addition, this section states hypotheses regarding the per¬ 
formance and structure of the learned diagnostic model and describes the evaluation procedure for 
testing the performance of the algorithm. The experimental results and analysis are presented in 
Section]^ Finally, the work is concluded in Section]^ 


2 Related Work 


Using the ADHD-200 competition dataset, which consists of several hundred resting-state fMRI 
scan^ Eloyan et al. fl^ explored several different classifiers for ADHD diagnosis, including a 
support vector machine, gradient boosting, and voxel-based morphology. In addition, several fea¬ 
ture extraction methods were investigated, including singular value decomposition and CUR matrix 
decomposition. The best classihcation accuracy was achieved by taking a weighted combination of 
these classifiers, which yielded 61.0% accuracy on the test data. Also using the ADHD-200 com¬ 
petition dataset, Sina et al. \ 101 extracted histogram of oriented gradient features from fMRI scans, 
which were then input to a support vector machine. The classifier yielded an accuracy of 62.6% 
on the test dataset. These two methods report the highest classification accuracy on the ADHD-200 
competition dataset. 


Recent studies @[ 3 ) emphasize that different parts of the brain are functionally correlated. Tak¬ 
ing this into consideration, the Human Connectome Project explores graphical models that seek 
to capture these functional connectivities, both in task-based and resting-state fMRI scansj^ Simi¬ 
larly, Zhang et al. | [T4) proposed a Bayesian network for modeling functional neural activity. In this 
work, each region of the brain is represented as a node in the graphical model and the functional 
connectivity of these nodes over time is used for classifying drug addicts from healthy controls. 


Apart from using functional relations, temporal correlation between brain voxels and connectivity 
has been explored by Fiecas et al. Q. Furthermore, the temporal relation between mental states and 
neuronal activities has been investigated by building a conditional random field 0. Duan et al. 
proposed two methods based on likelihood and distance measures to analyze fMRI scans using an 
HMM. However, this work focuses on analyzing the Blood Oxygen Level Dependent (BOLD) signal 
for brain activities in order to predict the brain activations in task-based fMRI time series. Eavani et 
al. 0 analyzed the functional connectivity dynamics in resting state fMRI and decoded the temporal 
variation of functional connectivity into a sequence of hidden states using an HMM. 


Similar to the previously mentioned temporal approaches to ADHD classification, we will investi¬ 
gate the temporal evoluation of voxels for both healthy and ADHD positive subjects using an HMM. 
However, we explore reduced dimensional representations of fMRI voxels in ADHD regions of in¬ 
terest. The analysis of regions of interest in an fMRI scan instead of the entire brain is common 
practice. For instance, Solmaz et al. | [T3| used a bag of words approach for identifying ADHD 
patients using the default mode network region of the brain. 


3 Temporal ADHD Diagnosis Algorithm 

In order to learn an ADHD classification model, several steps are necessary. For each time slice of 
the fMRI, voxels are extracted from ADHD Regions of Interest (ROIs) in the brain and dimensional¬ 
ity reduction algorithms are applied to the voxels in each region to reduce the dimensionality of the 
data. The resulting data is presented as observations to a cluster of Hidden Markov Models (HMMs) 
that learn to discriminate between healthy, ADHD inattentive, and ADHD combined types. Using 
the resulting classifier, new fMRI data can be input to the system; features are extracted from the 
fMRI data, which the classifier uses to diagnose the subject. This process is depicted in Figure 
Each step of the process is explained in more detail in the following sections. 

'^http : / / f con_l 000 .projects.nitrc.org/ indi/ adhd2 00 

’http://humanconnectome.org/ 
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Learning Performance 



Figure 1; Input, algorithm pipeline, and output of the learning (left) and performance task (right) of 
our algorithm. 


3.1 Extracting ADHD Regions of Interest 

For each time slice of each fMRI scan, we extract clusters of voxels that correspond to the twelve 
ADHD regions of interest from the cortex, cingulate gyrus, and thalamus regions of the brain 0, 
according to the Harvard-Oxford Cortical and Subcortical Structural AtlasEach ROI is composed 
of approximately 4000 voxels; there are 48,710 voxels across all ROIs, which is less than a tenth of 
the « 500, 000 voxels in one time slice of an fMRI scan. Despite this reduction in dimensionality, 
the dimensionality of the feature space is still quite large. To further reduce the dimensionality of the 
feature space, four dimensionality reduction algorithms—described in the following sections—are 
applied to the fMRI voxels extracted from the twelve ADHD regions of interest. 

3.2 Feature Dimensionality Redaction 

3.2.1 Voxel Means 

An unsophisticated method of dimensionality reduction is to simply compute the average value of 
the given feature set. Using this method, we compute the average of the fMRI voxel values in each 
region of interest. The output of this dimensionality reduction method is a matrix O G 
such that m is the number of subject fMRI scans in the dataset, t is the number of image samples 
over time, and 12 is the number of ROIs. 

3.2.2 Weighted Voxel Means 

The average value of the voxels ignores the fact that the voxels lying near the centre of a region of 
interest may contain more information than the voxels at the boundary. Taking this into considera¬ 
tion, we experimented with computing a weighted mean of voxels for each region of interest. The 
weights were derived from a univariate Gaussian distribution with a standard deviation of one and 
a mean value that is aligned with the center of that region. Hence, voxels further from the centre 
will be assigned smaller weights. The output of this dimensionality reduction method is a matrix 
O G ^ jjjg number of subject fMRI scans in the dataset, t is the number of 

image samples over time, and 12 is the number of ROIs. 

3.2.3 Principal Components Analysis 

Principal Components Analysis (PCA) GD applies an orthogonal transformation to an input dataset 
to convert a set of possibly correlated variables into a set of linearly uncorrelated variables named 
principal components. PCA is applied to the voxels in each region of interest and the three principal 
components with the largest spectral components are selected. The term spectral components refers 

"'http: //neuro. debian. net/pkgs/f si-harvard-oxford-atlases . html 
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to the singular values in the singular value decomposition of the input data matrix. Three principal 
components were selected because these singular values were substantially larger than the others, 
and thus, these principal components capture the most information about the input voxel data. The 
output of this dimensionality reduction method is a matrix O G K™xtx 36 ^ ^ jjjg number 

of subject fMRl scans in the dataset, t is the number of image samples over time, and 3 x 12 = 36 
is a concatenated vector containing the three principal components computed for each of the twelve 
regions of interest. 


3.2.4 Kernel Principal Components Analysis 

Kernel Principal Components Analysis (kPCA) is a non-linear dimensionality reduction technique 
that maps the data to a non-linear feature space that is defined by a kernel function. The mapping 
attempts to unfold the data onto a lower dimensional manifold. In the context of our project, we 
assume that the fMRI voxels in each region of interest lie on a ten dimensional manifold, which 
the kPCA algorithm will attempt to recover. Given an fMRI scan, for each time instance we apply 
kPCA on each of the twelve regions of interest. In order to stretch the underlying lower dimensional 
manifold, the feature space should maximize the distance between two neighboring points while 
keeping the locality constraint intact. Formally, if x^, Xj, and xj, are three neighboring points and 
their corresponding feature-space representation is and <l>fc then the problem of manifold 

learning becomes 

max||T>(xi) - T>(xj)|||., (1) 

4 > 


subject to the constraint that (4)(xi) — $(x^))^(<i)(xi) — $(xfc)) = (x^ — Xj)^(xi — x^) and 
= 0 116|. This optimization problem can be efficiently solved by semidefinite program¬ 
ming techniques. The output of this dimensionality reduction method is a matrix O G ]gmxtxi20^ 
such that m is the number of subject fMRI scans in the dataset, t is the number of image samples 
over time, and 10 x 12 = 120 is a concatenated vector containing the ten dimensional subspace 
computed by kPCA for each of the twelve regions of interest. 


3.3 Hidden Markov Model Classifier 

Regardless of the dimensionality reduction algorithm used, the feature output is a matrix O G 
j^mxtxn^ such that m is the number of subject fMRI scans, t is the number of image samples 
over time, and n is the size of the feature set after dimensionality reduction. The sequence of fMRI 
features of a subject are posed as observations to an HMM, a probablistic graphical model with the 
structure presented in Figure]^ The latent variables (hidden states) of the HMM correspond to the 
current mental state of the brain. For example, the mental state of the brain can be interpreted as the 
task or thought being focused on at any given time. 

For training the model, the fMRI scans in the dataset described in Section |4~T| are partitioned into 
groups according to their corresponding class label; healthy (1), ADHD inattentive (2), or ADHD 
combined (3). One HMM is trained for each class label. Each HMM Ai,Vi G {1, 2, 3} is initialized 
with random values for the initial state distribution, the transition matrix, and the emission distribu¬ 
tion. The emission distribution is a Gaussian mixture model. Preliminary experiments using mean 
voxel values for regions revealed that a mixture of five Gaussian distributions yielded optimal re¬ 
sults. From the dataset, the model parameters of each HMM were learned using the Baum-Welch 
algorithm Q. The Probabilistic Modeling Toolbox for Matlab was used for model initialization, 
training, and classification!^ 

Given an fMRI scan of a single subject O G ADHD classification is performed by returning 

the model that maximizes the probabililty of the observations O 

argmaxP(0|Ai), (2) 

iG{l,2,3} 

where P{ 0 \Xi) is computed by summing the forward variables in the Forward-Backward Procedure, 
which Rabiner Q meticulously describes. 


http://github.com/probml/pmtk3 
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Figure 2: Structure of the hidden Markov model for ADHD classification. Observations of the 
model are the different feature representations of the voxels in the regions of interest for ADHD. 
States of the model correspond to the current mental state of the brain. 


4 Algorithm Evaluation 


This section will describe the fMRI dataset used to train and evaluate the proposed temporal ADHD 
diagnosis model. Several hypotheses are made regarding the performance and structure of the model. 
To confirm or refute these conjectures, several experiments are proposed. 


4.1 Dataset 

The ADHD-200 competition dataset is used to train and evaluate our ADHD diagnosis model. The 
dataset consists of 940 fMRI scans of subjects that are labeled by health-care professionals as healthy 
control, ADHD inattentive, ADHD impulsive, or ADHD combined. Only 39 fMRI scans are labeled 
as ADHD impulsive, so these scans were removed from the dataset. There are 91 image samples 
over time for each subject fMRI scan. Therefore, the dataset is of the form O € l^soixgixn^ where 
the raw fMRI scans contain n = 510,340 voxels per time sample. 

The ADHD-200 competition dataset preprocesses the raw fMRI scans using the following transfor¬ 
mations: motion correction, which compensates for patient head movement; co-registration, which 
projects each time slice of an fMRI scan to a standard structural MRI scan space; normalization, 
which transforms the size and shape of each fMRI scan to have the same dimensions; temporal fil¬ 
tering, which removes drift or noise from the time-series data; and intensity normalization, which 
normalizes the voxel values to lie within the range [ 0 , 2 ]. 


4.2 Hypotheses 

We propose several hypotheses regarding our diagnosis model: 

(i) Increasing the model complexity (number of hidden states in the HMM) will increase classifi¬ 
cation accuracy. 

(ii) In terms of the dimensionality reduction algorithm used, the classification accuracy of the pro¬ 
posed diagnosis system using Kernel PCA features will be greater than PCA, which will be 
greater than weighted voxel means, which in turn will be greater than voxel means. The ra¬ 
tionale for this hypothesis is that taking the mean or weighted mean value of voxels in each 
of the twelve ROIs naively reduces the dimensionality of the feature space such that discrim¬ 
inative features for proper classification are discarded. We hypothesize that the alternative 
dimensionality reduction algorithms will preserve discriminative features. 

(iii) After training, the trace of the state transition matrix for the healthy HMM Ai is higher than 
the trace of the state transition matrices for the ADHD-i HMM A 2 and the ADHD-c HMM 
A 3 . Recall that the trace operator of a matrix sums the diagonal entries of the matrix, which 
are the probabilities that the mental state of the subject stays the same. The rationalization for 
this conjecture is that healthy subjects will tend to stay in the same mental state in contrast to 
subjects that are ADHD positive. 
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4.3 Evaluation Method 


We use 5-fold cross validation to evaluate our implemented diagnosis algorithm. The fMRJ dataset 
is partitioned into five subsets, such that in each iteration 4/5 of the dataset is used for training 
and 1/5 of the dataset is used for testing. A different subset is used for testing in each iteration. 
Care is taken to partition the dataset such that each subset is populated with an equal distribution of 
class labels, i.e., each of the five subsets contain an equal proportion of fMRI data corresponding 
to healthy, ADHD-i, and ADHD-c types. When testing on the fc* subset of data, our classifier will 
output a class label y G {healthy, ADHD-i, ADHD-c} for each of the ruk fMRI scans. Given the 
ground-truth labels y, the accuracy of our classifier on fold k is 

acc„racy(t) = (3) 

ruk 

The final classification accuracy is reported as the average of the classification accuracies computed 
for each of the five cross-validation folds (|^. 

To confirm or refute the hypotheses posed in the previous section, several experiments have been 
designed. To test hypothesis (i-ii), for each dimensionality reduction method proposed, 5-fold cross 
validation will be conducted for the following number of HMM states [4, 8,12,16, 20]. In this case, 
each of the three HMMs will have the same state value. Hypothesis (ii) can be accepted or rejected 
based on the results reported by the experiment proposed for hypothesis (i). To test hypothesis 
(iii), using the dimensionality reduction method and number of states parameter that received the 
highest accuracy in experiment (i), this model will be trained on the entire dataset and the trace of 
the transition matrix for the healthy HMM will be compared to the trace of the transition matrix for 
the ADHD-i and ADHD-c HMMs. 

5 Results 

The results of the experiment proposed for hypothesis (i) and (ii) are presented in Table [T] which 
displays the 5-fold cross-validation accuracy for different numbers of hidden states in the HMM as 
well as different fMRI dimensionality reduction methods. 

Our first hypothesis states that classification accuracy will increase as the model complexity in¬ 
creases. Analyzing the 5-fold cross-validation accuracy of the proposed model for different numbers 
of hidden states in the HMMs, (Table [^1 reveals that our first hypothesis is not quite correct. In the 
case of the voxel mean, weighted voxel mean, and kernel PCA dimensionality reduction methods, 
we see an increase in cross-validation accuracy as the model complexity increases until a point where 
further increases in complexity actually hinders classification performance. In the case of PCA, we 
see minute fluctuations in the classification accuracy as model complexity increases; however there 
is no apparent trend. This phenemenon suggests that when processing PCA features of fMRI scans, 
classification of ADHD may be independent of the internal hidden state of the HMM and alternative 
temporal models should be explored in this case. 

For each dimensionality reduction method, if we consider the number of states parameter that yields 
the maximum cross-validation classification accuracy (typeset in bold in Table [T]), our hypothesis 
that the classification accuracy of kernel PCA will exceed PCA, which will exceed weighted voxel 
means, which will exceed the accuracy of voxel means, is almost correct. The results show that the 
classification accuracy of PCA and kPCA dimensionality reduction is indeed greater than weighted 
voxel means, which is greater than voxel means. However, the accuracy of PCA and kPCA are 
almost equivalent and outperform state-of-the-art diagnosis systems on the ADHD-200 dataset pO] 
[TS) . Hence, we conclude that for PCA, the three components along the maximum variability of 
data successfully captures a significant amount of information necessary for classifying ADHD. 
Also, our assumption that the fMRI data for each subject and for each time instance lies on a lower 
dimensional manifold is also true for the purposes of classification. 

Our third hypothesis states that when training our classifier with 20 hidden states for each of the three 
HMMs on the entire fMRI dataset using PCA dimensionality reduction, the trace of the transition 
matrix for the healthy HMM will be larger than the trace of the transition matrices for the ADHD-i 
and ADHD-c HMMs. The results of this experiment are presented in TableThough the results 
show that our hypothesis is in fact correct, the differences in trace are too insignificant to draw any 
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Table 1: 5-fold cross-validation results for ADHD classification (healthy, ADHD-i, or ADHD- 
c) using various numbers of HMM states for the following feature representations: voxel means, 
weighted voxel means, PCA, kPCA 


Number of States Voxel Means 


Weighted 

Voxel Means 


PCA* 


KPCAt 


4 

44.86% 

48.55% 

62.80% 

60.22% 

8 

49.80% 

54.16% 

62.26% 

61.07% 

12 

51.58% 

52.34% 

61.84% 

61.50% 

16 

53.89% 

49.27% 

62.15% 

62.06% 

20 

48.49% 

47.64% 

63.01% 

60.62% 


Table 2: Trace of the transition matrices for the healthy control, ADHD-i, and ADHD-c HMM with 
20 hidden states, when trained using the entire dataset of fMRIs that were reduced in dimensionality 
using kernel PCA. 


HMM 

Transition Matrix Trace 

Healthy 

1.9496 

ADHD-I 

1.9490 

ADHD-C 

1.9484 


concrete conclusions. Perhaps the most interesting question that arises from this experiment is why 
the trace of the transition matrices are so small. Note that a trace value of two on a transition matrix 
of the form A G [0, l] 20 x 20 implies that each self-state transition in the HMM is approximately 
2/20 = 0.1. This suggests that most of the transitions among the underlying Markov chain may be 
from one state to another instead of from one state to itself. 

Upon further investigation of the transition matrices, it seems the opposite is true. In all cases, i.e., 
for all numbers of hidden states, and for all dimensionality reduction algorithms explored, one state 
in the Markov chain has a high self-state transition probability in the order of approximately 0.99. 
That is, 99% of the time, the Markov chain will remain in this state once it wanders onto the state. 
This means that a single state is essentially attempting to explain every single time slice of the fMRJ, 
which is not the desired effect of using such a temporal model. 

Although the results of our proposed temporal ADHD classification model are comparable to state- 
of-the-art ADHD classifiers published in the literature eg, our system barely performs above the 
threshold of labeling every input data instance as the majority class in the training dataset. The 
majority class is the healthy subject type and amounts to 62.13% of the dataset. This is compared 
to our best result of 63.01% when using 20 hidden states and principal components analysis for 
dimensionality reduction of the fMRI. 

There are several explanations for the lackluster performance of the proposed classifier, some of 
which pertain to the performance of ADHD classification systems in general. Perhaps the difference 
between the resting state brain activity for individuals who are ADHD positive and ADHD negative 
is negligible. It could also be the case that the regions of interest extracted from the resting-state 
fMRI do not contain discriminative information for diagnosing ADHD and other regions of interest 
should be explored. Another possible explanation is that the sampling frequency of the physical 
fMRI machine does not coincide with the frequency with which the brain switches mental processes 
and our temporal model is not synchronized with the real mental state transitions of the brain. 

It could also be that the dimensionality reduction algorithms that we explored in this paper severely 
degrade the discriminative power of the raw voxel intensities. On the other hand, it could be that 
the dimensionality reduction methods still yield too many features and feature selection algorithms 
should be applied to the observations posed to the HMMs. To see if this is the case, we performed 
pairwise t-tests on the voxel mean features for the twelve regions of interest for ADHD and found 
that region number 4 (Intra-Calcarine Cortex), 5 {Frontal Medial Cortex), 9 {Cingulate Gyrus, 

‘Using 9 Gaussian mixtures instead of 5. 

^ Using 16 Gaussian mixtures instead of 5. 
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posterior division), and 10 {Frontal Orbital Cortex) do not exhibit any discriminative power between 
healthy controls and ADHD positive subjects. That is, the null hypothesis that a set of features are 
not important is true for these regions of interest. 

6 Conclusion 

The development of automatic ADHD diagnostic algorithms from fMRI data is a challenging task. 
The application of statistical pattern recognition algorithms to this problem currently yield insub¬ 
stantial results, rendering these classification systems unfit for practice in the health-care industry. 
However, much research is being done to improve these results and search for discriminative features 
for classifying ADHD amongst the plethora of voxel values present in a single fMRI scan. 

Apart from systems that aggregate fMR images over time to produce a single three-dimensional 
image of the brain that is then used for classification, in this paper we explored a temporal classifi¬ 
cation system that uses the evolution of voxel values over time to make decisions regarding ADHD 
diagnosis. Specifically, we used the ADHD-200 competition dataset to learn an HMM classifier 
that classifies the subject as healthy, ADHD inattentive type, or ADHD combined type from an 
input fMRI scan. We investigated and evaluated the application of several dimensionality reduc¬ 
tion algorithms to twelve ADHD regions of interest in the fMRI scan. Our results indicate that a 
hidden Markov model with 20 hidden states processing fMRI data, where the dimensionality is re¬ 
duced by the principal components analysis algorithm, yields the best results with 63.01% 5-fold 
cross-validation accuracy. 

Still, there is much work to be done in this area. Our analysis of the transition matrices of the trained 
hidden Markov models indicate that other temporal models, such as recurrent neural networks or 
hidden Markov models with more arcs between nodes, should be explored in future work. Moreover, 
we propose that locating lower dimensional sets of fMRI features that retain discriminative power 
for ADHD classification is the heart of the problem and that the majority of future work should focus 
on this task. 
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